{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Solutions\n",
    "## Thought exercises\n",
    "1. Explore the Jupyter Lab interface and look at some of the shortcuts available. Don't worry about memorizing them now (eventually they will become second nature and save you a lot of time), just get comfortable using notebooks.\n",
    "2. Are all data normally distributed? \n",
    "> No. Even data that might appear to be normally distributed could belong to a different distribution. There are tests to check for normality, but this is beyond the scope of this book. You can read more [here](https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/).\n",
    "3. When would it make more sense to use the median instead of the mean for the measure of center? \n",
    "> When your data has outliers, it may make more sense to use the median over the mean as your measure of center.\n",
    "\n",
    "## Coding exercises\n",
    "### Exercise 4: Generate the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "random.seed(0)\n",
    "salaries = [round(random.random()*1000000, -3) for _ in range(100)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 5: Calculating statistics and verifying\n",
    "#### mean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from statistics import mean\n",
    "\n",
    "sum(salaries)/len(salaries) == mean(salaries)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### median"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import math\n",
    "\n",
    "def find_median(x):\n",
    "    x.sort()\n",
    "    midpoint = (len(x) + 1) / 2 - 1 # subtract 1 bc index starts at 0\n",
    "    if len(x) % 2:\n",
    "        # x has odd number of values\n",
    "        return x[int(midpoint)]\n",
    "    else:\n",
    "        return (x[math.floor(midpoint)] + x[math.ceil(midpoint)]) / 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from statistics import median\n",
    "\n",
    "find_median(salaries) == median(salaries)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### mode"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from statistics import mode\n",
    "from collections import Counter\n",
    "\n",
    "Counter(salaries).most_common(1)[0][0] == mode(salaries)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### sample variance\n",
    "Remember to use Bessel's correction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from statistics import variance\n",
    "\n",
    "sum([(x - sum(salaries)/len(salaries))**2 for x in salaries])/(len(salaries) - 1) == variance(salaries)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### sample standard deviation\n",
    "Remember to use Bessel's correction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from statistics import stdev\n",
    "import math\n",
    "\n",
    "math.sqrt(sum([(x - sum(salaries)/len(salaries))**2 for x in salaries])/(len(salaries) - 1)) == stdev(salaries)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 6: Calculating more statistics\n",
    "#### range"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "995000.0"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "max(salaries) - min(salaries)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### coefficient of variation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.45386998894439035"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from statistics import mean, stdev\n",
    "\n",
    "stdev(salaries)/mean(salaries)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### interquartile range"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "import math\n",
    "def quantile(x, pct):\n",
    "    x.sort()\n",
    "    index = (len(x) + 1) * pct - 1\n",
    "    if len(x) % 2:\n",
    "        # odd, so grab the value at index\n",
    "        return x[int(index)]\n",
    "    else:\n",
    "        return (x[math.floor(index)] + x[math.ceil(index)]) / 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum([x < quantile(salaries, 0.25) for x in salaries]) / len(salaries) == 0.25"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum([x < quantile(salaries, 0.75) for x in salaries]) / len(salaries) == 0.75"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "417500.0"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "q3, q1 = quantile(salaries, 0.75), quantile(salaries, 0.25)\n",
    "iqr = q3 - q1\n",
    "iqr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### quartile coefficent of dispersion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.3417928776094965"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iqr / (q1 + q3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 7: Scaling data\n",
    "#### min-max scaling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.0,\n",
       " 0.01306532663316583,\n",
       " 0.07939698492462312,\n",
       " 0.0814070351758794,\n",
       " 0.08944723618090453]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "min_salary, max_salary = min(salaries), max(salaries)\n",
    "salary_range = max_salary - min_salary\n",
    "\n",
    "min_max_scaled = [(x - min_salary) / salary_range for x in salaries]\n",
    "min_max_scaled[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### standardizing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[-2.199512275430514,\n",
       " -2.150608309943509,\n",
       " -1.9023266390094862,\n",
       " -1.8948029520114855,\n",
       " -1.8647082040194827]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from statistics import mean, stdev\n",
    "\n",
    "mean_salary, std_salary = mean(salaries), stdev(salaries)\n",
    "\n",
    "standardized = [(x - mean_salary) / std_salary for x in salaries]\n",
    "standardized[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 8: Calculating covariance and correlation\n",
    "#### covariance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0.07137603, 0.26716293],\n",
       "       [0.26716293, 1.        ]])"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "np.cov(min_max_scaled, standardized)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.26449129918250414"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from statistics import mean\n",
    "running_total = []\n",
    "for x, y in zip(min_max_scaled, standardized):\n",
    "    running_total.append((x - mean(min_max_scaled)) * (y - mean(standardized)))\n",
    "\n",
    "cov = mean(running_total)\n",
    "cov"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Pearson correlation coefficient ($\\rho$)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9900000000000001"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from statistics import stdev\n",
    "cov / (stdev(min_max_scaled) * stdev(standardized))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
